Dynamic Warp Resizing in High-Performance SIMT
نویسندگان
چکیده
—Modern GPUs synchronize threads grouped in a warp at every instruction. These results in improving SIMD efficiency and makes sharing fetch and decode resources possible. The number of threads included in each warp (or warp size) affects divergence, synchronization overhead and the efficiency of memory access coalescing. Small warps reduce the performance penalty associated with branch and memory divergence at the expense of a reduction in memory coalescing. Large warps enhance memory coalescing significantly but also increase branch and memory divergence. Dynamic workload behavior, including branch/memory divergence and coalescing, is an important factor in determining the warp size returning best performance. Optimal warp size can vary from one workload to another or from one program phase to the next. Based on this observation, we propose Dynamic Warp Resizing (DWR). DWR takes innovative microarchitectural steps to adjust warp size during runtime and according to program characteristics. DWR outperforms static warp size decisions, up to 1.7X to 2.28X, while imposing less than 1% area overhead. We investigate various alternative configurations and show that DWR performs better for narrower SIMD and larger caches.
منابع مشابه
A Hardware-Software Integrated Solution for Improved Single-Instruction Multi-Thread Processor Efficiency
This thesis proposes using an integrated hardware-software solution for improving Single-Instruction Multiple-Thread branching efficiency. Unlike current SIMT hardware branching architectures, this hardware-software solution allows programmers the ability to fine tune branching behavior for their application or allow the compiler to implement a generic software solution. To support a wide range...
متن کاملWarp-Level Parallelism: Enabling Multiple Replications In Parallel on GPU
Stochastic simulations need multiple replications in order to build confidence intervals for their results. Even if we do not need a large amount of replications, it is a good practice to speed-up the whole simulation time using the Multiple Replications In Parallel (MRIP) approach. This approach usually supposes to have access to a parallel computer such as a symmetric multiprocessing machine ...
متن کاملSimty: a Synthesizable General-Purpose SIMT Processor
Simty is a massively multi-threaded processor core that dynamically assembles SIMD instructions from scalar multi-thread code. It runs the RISC-V (RV32-I) instruction set. Unlike existing SIMD or SIMT processors like GPUs, Simty takes binaries compiled for generalpurpose processors without any instruction set extension or compiler changes. Simty is described in synthesizable RTL. A FPGA prototy...
متن کاملMulti-tier Dynamic Vectorization for Translating GPU Optimizations into CPU Performance
Developing high performance GPU code is labor intensive. Ideally, developers could recoup high GPU development costs by generating high-performance programs for CPUs and other architectures from the same source code. However, current OpenCL compilers for non-GPUs do not fully exploit optimizations in well-tuned GPU codes. To address this problem, we develop an OpenCL implementation that efficie...
متن کاملMulti-Level Cache Resizing
Hardware designers are constantly looking for ways to squeeze waste out of architectures to achieve better power efficiency. Cache resizing is a technique that can remove wasteful power consumption in caches. The idea is to determine the minimum cache a program needs to run at near-peak performance, and then reconfigure the cache to implement this efficient capacity. While there has been signif...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1208.2374 شماره
صفحات -
تاریخ انتشار 2012